This is a redirector for squid that intercepts advertising (banners, popup windows, flash animations, etc), page counters and some web bugs (as found). This has both aesthetic and bandwidth benefits. It's also easy to install. Note: you can use Apache instead of Squid if you like.
It remains strongly my desire that this code is not used for censorship. By this, I mean that while it could be adapted to block all sorts of content, I wish that it not be so used without parallel provision of unblocked browsing. Feel free to protect yourself from stuff with this code; do not force your blinkered view onto others without their consent. There are instructions in this page for easy provision of zapped and unzapped browsing to users; please use them.
Previous licensing:
Until Wednesday 26may1999, this code was free for use by all.
However,
the Australian Government
brought in some
truly
stupid
and invasive
legislation,
so this code is now free
except that it MAY NOT be used
to enforce or support
that legislation
or other legislation of similar intent.
I'm happy for people to use it to filter their own browsing,
but not for people to force their morals onto others.
Ad zapping is not a new idea. Basicly you interpose between the reader and the web some kind of filter which replaces those annoying ad banners with something unobtrusive. (There are a few motivations for this; see this digression for mine.)
I first came across it at my ISP (Zip World - www.zip.com.au) a few years ago. Their technique was to use a complicated proxy.pac file. They supplied two: one which zapped ads and one which didn't. The zapping one was, I discovered, a piece of JavaScript which told your browser to go to one proxy for URLs matching known ad patterns and to the main proxy for everything else. The former proxy simply returned a placeholder GIF for everything asked of it. Initially I copied this for use at our site.
This method is a bit cumbersome. Firstly, you have to run a special web server to serve the placeholder GIF. Secondly, JavaScript interpreters are slow and (in Netscape at least) a tad buggy - eventually the browser gets flakey and may fall over. Thirdly, not all browsers support JavaScript and those that do needn't support proxy.pac files. Finally, the file was a pain to maintain and the size was making me fear for the sanity of the JavaScript interpreters.
Enter squid, arguably the best web proxy around. One great feature is the redirector. This is a program which reads request information on its input and writes (possibly) redirected information on its output. If activiated, squid will consult it for every request, permitting easy interception of ads. All you have to do to activate it is place the line:
redirect_program /home/marshall/bin/squid_redirectin your squid.conf file. Obviously, that pathname should be replaced by wherever you install the redirection program.
Attempt number 1 was a shell script. Short and effective, it was a simple while loop with a case statement. However, it seemed to have some scaling problems. Now it is a perl script called squid_redirect. In particular, because the expressions are compiled when the script starts the redirector runs quite efficiently.
Microsoft Windows users should read the notes for Windows users below.
Smoothwall Firewall users may want to see Martin Pot's Smoothwall Ad Zap Installation Instructions.
There's also a less wordy quick'n'dirty installation kit here
by Gaute Lund, with this readme file.
Note: you can also use Apache instead.
Note:
minor security remark: of course your proxy (squid or apache) should not be available to the internet at large. Generally your proxy will be ok automatically, simply by being inside your firewall. However, if you install a proxy on some public machine you should make sure it has some sort of access control. If you're installing on a personal machine such as a laptop that is sometimes on a public net, probably your proxy should listen only on the local interface (127.0.0.1).
emerge adzapper
Note 1: The script must be executable. Run the command:
chmod a+rx the-scriptwhen it's in place.
Note 2: the first line of the script says:
#!/usr/bin/perlYou may want to change this to:
#!/usr/local/bin/perlor suchlike if your perl isn't in /usr/bin. (Or put a symlink in /usr/bin - this may save you hassle with other perl scripts, many of which also expect a /usr/bin/perl.)
Note 3:
If you used a Windows box to fetch the script (eg via Internet Explorer)
and then transfered it to the machine running your squid proxy
then it's possible for the script to end up on your proxy in DOS text mode,
which means it ends every line with a CR and a NL character
(instead of just NL).
If you suspect this,
see this troubleshooting section.
redirect_program /path/to/squid_redirectinto the squid.conf file.
kill -1 pid-of-squidYou should also do this after you've updated the script; squid starts new instances of the redirector.
squid -k reconfigureto do the same thing.
redirect_program C:/perl/bin/perl.exe c:/squid/etc/adzapper.pladjusting C:/perl/bin/perl.exe and c:/squid/etc/adzapper.pl to match your own install locations. This tip was obtained from the BannerFilter page. You will need SquidNT and ActivePerl or other versions of Squid and Perl for Windows, for example you might run both under Cygwin. It is also mentioned in this thread from the squid-users mailing list.
Alternatively, you can also use adzapper with Apache2. This has the advantage of being IPv6 compatible. To do this, make Apache2 load mod_proxy and mod_redirect and configure it as follows:Also, edit the new "ZAP_CHANGE_VALUE" configuration variable and set it to NULL:ProxyRequests On RewriteEngine On RewriteLock /var/lock/apache2/rewrite-adzapper RewriteMap adzap prg:/usr/bin/adzapper.wrapper <Proxy *> Order deny,allow Deny from all Allow from localhost RewriteRule ^proxy:(.*)$ proxy:${adzap:$1|$1} [L] </Proxy>ZAP_CHANGE_VALUE="NULL"
Simply tell wrapzap the full install path of squid_redirect and tell the squid.conf file the full path of the wrapzap script instead of the zapper. Then modify wrapzap to suit. Remember that all scripts should have public read/execute permissions:
chmod a+rx scripts...
The patterns in $ZAP_PREMATCH are consulted before the main pattern list and the patterns in $ZAP_POSTMATCH afterwards. Generally you use the latter to add extra patterns and only use the former to correct overzapping by some erroneous patterns in the main pattern file. If you find such, tell me! That way your $ZAP_PREMATCH file can usually be empty and stay that way.
Finally, you can have squid_redirect ignore its inbuilt pattern list completely and use your own by defining the environment variable $ZAP_MATCH.
CLASS pattern
The CLASS specifies the type of object the pattern recognises.
The special class PASS means that URLs matching the pattern
should not be redirected i.e. they should be left alone and not zapped.
It is used to insert exceptions for general rules.
For example, this snippet from the pattern list:
PASS http://(www*.|)mozilla.org/**-banner.gifmeans zap everything ending in -banner.gif except for things at mozilla.org.
AD http://**-banner.gif
The pattern more resembles a Bourne shell glob than a regular expression. In fact it is shorthand for a regular expression with the following differences:
CLASS pattern replacementThis rewrites a URL from one form to another. The syntax for the pattern is exactly as for the first type of line. The replacement is a perl string as would be found inside double quotes. In particular, the values $1, $2 etc match the bracketed substrings in the pattern as with perl. For example, this rule:
PRINT http://(www*.|)smh.com.au(/articles/**.html) http://www.smh.com.au/cgi-bin/common/popupPrintArticle.pl?path=$2replaces an article with its "printer-friendly" version. The $2 comes from the pattern's second bracketed section.
Note: the PRINT rules are off by default. You need to turn it on with the wrapzap script.
The $ZAP_MODE variable can be set to the word "CLEAR" to cause the zapper to use "clear" versions of the replacement images and text. This will mean the ads just "vanish" from your pages. The only real downside to this is that is the zapper, through some mischance, replaces some useful markup on the page then it's not very apparent.
The $ZAP_BASE variable can be set to point to a web directory containing your own versions of the replacement images. Place files named ad.gif, adbg.gif, ad.swf, closepopup.html, counter.gif, no-op.html no-op.js, and webbug.gif there. If you're using the "CLEAR" mode then you need files named x-clear.ext for every file x.ext listed above.
The default for $ZAP_BASE is http://adzapper.sourceforge.net/zaps. If you set the $ZAP_MODE variable to "CLEAR" then you will naturally want files named ad-clear.gif, closepopup-clear.html, no-op-clear.html, etcetera.
You can replace classes of ad with specific replacements. The following classes are known: AD for inlined images, ADHTML for separate HTML pages inserted as an ad (usually via FRAME, IFRAME or ILAYER tags), ADJS for javascript programs used to generate ads, ADBG for background images containing ads, ADSWF for ads implemented as Shockwave animations, ADMP3 for ads implemented as MP3 audio, ADPOPUP for those mega-annoying ads which pop up on their own as new web pages, COUNTER for inlined visitor count images and WEBBUG for web bugs. Each of these words matches the keyword on the start of the lines in the configuration file. To control each you would set the variable $STUBURL_class to the URL of the specific replacement for that class.
For example, setting
STUBURL_AD=http://adzapper.sourceforge.net/zaps/ad-clear.gifwhich would cause the inlined images to be the "clear" version while leaving the other classes as normal. That ad-clear.gif is a transparent single pixel GIF donated by David Finster <dfinster@airmail.net>. Another image you might like is http://adzapper.sourceforge.net/zaps/ad-grey.gif, from Andrew Dalgleish <andrewd@axonet.com.au>, which is a low contrast replacement image which lets you see what's zapped without it standing out so much.
Chris Lightfoot <chris@ex-parrot.com> wrote asking if I could make the zapper friendly to setups where people chain multiple redirection programs together (for example, to run both the ad zapper and another tool like SquidGuard). Then Adam Hope <a.hope@csl.gov.uk> wrote to say that they were chaining to another redirector which wanted the full 4 word input a redirector may expect.
The specification for the redirectors says unredirected URLs should be indicated with a blank line, which is no good for piping the output of one into the next. Accordingly, to chain redirectors a wrapper program is needed to pass URLs to each redirector in turn.
To chain redirectors:
chmod a+rx scripts...
exec "$zapper"to:
# exec /path/to/zapchain "$zapper" /path/to/another/eg/squirm
# exec "$zapper"and adjust the pathnames to suit. You may name as many different redirectors as you like, not just two.
exec /path/to/zapchain "$zapper" /path/to/another/eg/squirm
redirect_program /path/to/squid_redirectto be:
redirect_program /path/to/wrapzap
Note: if you keep your own set of extra patterns, see the customisation section - in particular the section on extra pattern files - for how to use the wrapzap script to keep these additions separate, so as not to be overwritten by the script update.You have several choices about keeping up to date with the patterns (and the matching squid_redirect).
0 0 * * * /usr/local/etc/update-zapper(Hack "/usr/local/etc/update-zapper" to match wherever you install the automatic update script.) That should go in the crontab of some user with write permission to the squid_redirect script as installed on your squid host and permission to send signals to the zapping squid daemon. Probably this user is called "squid", but this would work as root too. Then install the update-zapper script. The script needs the wget program.
If you have to support more than a few users, you may want to use a proxy.pac file. This is a file containing a JavaScript function used by a browser to decide which proxy to use (if any) on a per-URL basis. This is often known as "automatic proxy configuration", as all you tell the browser configuration is the URL of the proxy.pac file. Once you've set this up for each of your users, you can then control things by editing the central file. Both Netscape and Internet Explorer support proxy.pac files.
Thus it may become a case of "do it yourself". However, at least in my case, ZipWorld were happy enough to up my disc limit a bit, let me run the zapper all the time (even when not logged in), and automate a monthly post to a local newsgroup to tell people about the zapper. Very cool!
Something to bear in mind if you implement this for an ISP (or anywhere where the zapper isn't behind a firewall): to avoid having their site hammered ZipWorld asked me to limit access to the zapper at Zip to a the list of IP address ranges that they own. To this end the ranges are in a file and the squid config for the zapper there says:
acl zipworldIP src "/home/cs/rc/squid/ip-ranges@zip"I also customised the ERR_ACCESS_DENIED page that squid returns for unauthorised access.
acl zipworldDNS srcdomain zipworld.com.au zipworld.net.au zip.com.au zip.net.au zipworld.net pacific.net.au
http://adzapper.sourceforge.net/rc/proxy-zip.pacUsers of other ISPs can contact me for details on how to I set this up.
Here are a few example .pac files which I've set up for various sites. Each would require some customisation for your own site.
Ad sites commonly set permanent cookies on your browser. Via use of the HTTP_REFERER header they can then often track your activities on all sites they advertise on, and some companies advertise across a very wide range of popular sites. It wouldn't be unusual for them to get personal information on you just from the URLs in the HTTP_REFERER field, which may include form values including your name, things you like to search for, etc.He says this was his primary motivation for zapping doubleclick ads in 1996. I'd remark that while an ad zapper protects you from this (cookies attached to the inlined image), naturally if you follow the ad link anyway (since the savvy marketer will add a useful descriptive caption under the banner, permitting you to know what the zapped ad was for) then you're on your own.See also - Cookie RFC (privacy section, but note that the popular browsers don't follow the guidelines!):
http://www.cis.ohio-state.edu/htbin/rfc/rfc2109.html
Similarities:I've noticed that the recent squid release (2.2STABLE4 as I type this) has anonymising facilities, so you can perhaps use those in conjunction with squid_redirect to get what you want.
Both filter out those annoying advertisement pages that waste time and bandwidth, meaning money (we're paying for that!) Both use a list of sites and regular expressions to eliminate these advertisements. Both redirect the image to a default, smaller image.That's where the similarities end.
Differences:
Ad Zapper integrates much more nicely into squid. It is started from within squid (as many processes as you like) and is basically a URL redirector based on regular expressions that are contained inside the script.
Junkbuster runs as a separate daemon, and you have to use it as a hierachial cache, with junkbuster as either the parent or child. I found having it as the parent (they document how to set it up as a child in the docs) to be the superior configuration. All fetches from an external web page must be redirected through junkbuster - which is quite slow compared to squid. Also the double handling makes for a slower transaction.Ad Zapper zaps ads - that's it. Junkbuster also can filter out cookies and web pages (like those annoying small ones that advertise the free web pages the site is from) I have found junkbuster to be a little too constrictive. It can also to web anonymity and return wafers instead of cookies for you with "leave me alone" privacy messages in them for the web administrators.
My recommendation is this: If you want tight security then go for junkbuster. You're sacrificing some speed and some pages which simply wont load anymore since the pattern matching tries too hard. If you want performance without ads, go for Ad Zapper (you can even specify your own image which you can't do with junkbuster)
http_port 8080 8081Then you say that only accesses to one of the ports use the redirector:
acl nobannerport myport 8080That way people using port 8080 will get the zapping service and people using port 8081 will get the raw, uglified web.
redirector_access allow nobannerport
We have a double squid cache (once on the same machine, now on separate machines). The usual proxy for users is:
proxy:8080which has no cache and the URL redirector in its config:
redirect_program /opt/UCSDsquid/bin/squid_redirectThis lives off the main, non-redirecting cache at:
proxy-raw:8080which has a big cache. The proxy.pac file users use points them at:
PROXY proxy-noads:8080; PROXY proxy-raw:8080and the proxy-raw.pac (which shows ads) says:
PROXY proxy-raw:8080The CNAMEs proxy-noads and proxy-raw point at the zapping and nonzapping squids, respectively. The CNAME proxy points at the same machine proxy-noads does. That way the naive and memorable setup gets a zapped view of the web. If your site policy is different you can just point proxy at the nonzapping machine and publicise the zapper as an optional service.
Basic checks:
chmod a+rx scripts...
telnet proxy.your.isp.domain 8080If you don't get a connection, try port 3128 instead of 8080.
GET http://www.zip.com.au/~cs/ HTTP/1.0and press return twice You should get an HTTP response (code 200 hopefully), some header lines, then some HTML. If you don't then that's not your ISP's proxy service, and you must contact them to find out the correct details.
cache_peer 203.12.172.230 parent 8080 3130 no-query defaultYou would replace the "203.12.172.230" with the name of your ISP's proxy (eg "proxy.your.isp.domain") and the 8080 with the matching port number (probably the same).
netstat -an | grep -i listento check that squid (presumably) is listening on port 8080 on your machine.
telnet localhost 8080to check, and issue the same GET command you used above to fetch a web page.
the-script </dev/nullThat should do nothing, with no complaints. If this is greeted with messages like:
the-script: exec failed: No such file or directorythen you may have spurious CR characters in there. You can verify this with the command:
sed 1q the-script | od -cwhich will print:
0000000 # ! / u s r / b i n / p e r l \nfor a good script and:
0000020
0000000 # ! / u s r / b i n / p e r l \rfor a bad script (note that extra \r, which is a carriage return (CR)). These can be deleted with the tr command, viz:
0000020 \n
tr -d '\015' <the-script >the-script.fixedwhich makes a new copy without the CRs and then replaces the orignal with the new one. The dos2unix(1) command can also be used for this task, if available.
mv the-script.fixed the-script